-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2857: Runtime Assisted Mounting of Persistent Volumes #2893
Conversation
ddebroy
commented
Aug 24, 2021
- One-line PR description: Provisional KEP for Runtime Assisted Mounting of Persistent Volumes
- Issue link: Runtime assisted mounting of Persistent Volumes #2857
- Other comments: Initial KEP PR
986ace7
to
e4721ee
Compare
on a persistent volume, it is not expected to support any deferred file system | ||
management operations since the runtime will not own a file system mount to manage. | ||
|
||
A new CSI service: `Runtime` is proposed to coordinate the file system management |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean gRPC
service here? Runtime
is going to be implemented by a container runtime, not a CSI driver, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - a gRPC
service that will be implemented by a container runtime and not CSI Node Plugin. Will clarify that.
should be approved by the remaining approvers and/or the owning SIG (or | ||
SIG Architecture for cross-cutting KEPs). | ||
--> | ||
# KEP-2857: Runtime Assisted Mounting of Persistent Volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"generic ephemeral volumes" also would be handled this way. It would be better to keep "persistent" out of the KEP name.
As this is designed to work only with volumes provided by a CSI driver, perhaps replace "Persistent" with "CSI"?
volumes to the container runtime is a basic requirement in the context | ||
of this KEP. Overview of the enhancements necessary to support this are: | ||
- A new field in RuntimeClass to indicate a handler/runtime can support mount | ||
deferral. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth calling out here early that NodeStageVolume
cannot be used when a CSI driver wants to support deferred mounts.
stage the volume (ensuring it is formatted) and mount the filesystem | ||
(associated with the PV) in the node host OS environment. This stage happens | ||
irrespective of the runtime's capabailities indicated in the subsequent CSI | ||
NodeVolumePublish call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of putting this work into NodeStageVolume
? Why not skip the call and do the work in NodePublishVolume
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will put a note clarifying ^ is the preferred approach.
// SupportsFSMounts indicates whether the Handler supports mounting of | ||
// FileSystem for persistent volumes referred by the pod | ||
// Default False | ||
SupportsFSMounts bool `json:"supportsFSMounts,omitempty" protobuf:"bytes,5,opt,name=supportsFSMounts"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API reviewers will know this better, but I suspect this might have to be a pointer to handle cases where the client doesn't support the field (set vs. not set).
// by the Handler over which FileSystem Management APIs can be invoked. Should | ||
// not be specified if Handler does not support FileSystem Management APIs. | ||
// +optional | ||
FSMgmtSocket *string `json:"fsMgmtSocket,omitempty" protobuf:"bytes,6,opt,name=fsMgmtSocket"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that the path is the same for all nodes in the cluster, doesn't it? Can that be guaranteed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will put a recommendation to consider separate RuntimeClass specs with Scheduling
(https://kubernetes.io/docs/concepts/containers/runtime-class/#scheduling) to target the different classes of nodes with the same runtime but different configuration for the domain socket path and specify the corresponding domain socket path for each RuntimeClass.
``` | ||
|
||
During the Alpha phase, the above fields can be introduced as annotations on | ||
RuntimeClass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using annotations instead of proper fields has fallen out of favor as far as I know.
// network share when SP defers file system mounts to a container runtime. | ||
// A SP MUST populate this if runtime_supports_mount was set in | ||
// NodePublishVolumeRequest and SP is capable of deferring filesystem mount to | ||
// the container runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new capability needs to be defined. Otherwise kubelet won't know whether the SP checks runtime_supports_mount
and sets this field.
Note that the OCI spec already allows the `type` and `options` fields to be | ||
specified as part of the `mount` field. Therefore, no enhancements should be | ||
necessary in the OCI spec. OCI runtimes supporting mount deferral need to be | ||
able to execute the filesystem mount with the details specified in `mounts`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this be checked? Do users have to ensure that they don't ask for something that isn't supported, like creating a volume with XFS as filesystem and then using that with a runtime that doesn't support XFS?
- The node CSI plugin receives a CSI `NodeVolumePublish` call. In response, the | ||
CSI plugin will unmount the file system (if mounted as part of stage) and pass | ||
the block device path (rather than a file system mount point) along with file-system | ||
type and mount options to the Kubelet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this mounting in NodeStage and unmounting in NodePublish is just waste of resources. A driver should announce its capability to "provide a block volume for filesystem PV" and kubelet should do the right thing right away.
specifying details of how and what to mount on a block device or network share. | ||
``` | ||
message FileSystemMountInfo { | ||
// Source device or network share (e.g. /dev/sdX, srv:/export) whose mount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of CSI is to hide all storage details from Kubelet and the host OS into CSI drivers. If we want a micro VM to mount a generic CSI volume, then it needs to support all possible filesystems that any CSI driver may use, for example CephFS, CIFS, GlusterFS and NFS, mentioning just the major open source ones.
-
How does kubelet know what filesystems the container runtime / VM supports? What if the VM does not have utils to mount GlusterFS, but a pod uses a Gluster PV? Or kernel module for zfs?
-
Some CSI drivers (say Gluster or CephFS) may need a config file to mount a filesystem or even get credentials from a Secret. If a VM had utils for these filesystems, how does it get the config + secrets?
-
Some proprietary CSI drivers need a kernel module or fuse daemon. Are they supposed to run in the VM?
It is possible to solve all the above, but not without leaking the storage details from the CSI driver either back to kubelet or to the micro VM / its OS / its micro kubelet (or whatever it runs to mount volumes).
Could the micro VM run the CSI driver in the end?
Another possibility would be to limit this proposal to volumes based on block devices. FSType is already quite exposed in Kubernetes API and everyone sort of supports ext4 and xfs. Kubelet could know that a CSI driver supports "block devices for filesystem volumes" capability and it could know that pod's RuntimeClass can mount them (with a list of supported filesystems, say xfs and ext4). Kubelet then could call NodePublish (or even some new CSI call) to get a "block device for a filesystem PV" and give it to CRI to mount. For PVs that use filesystems not supported by the VM (say zfs) or don't support block devices (NFS, CephFS, ...) or don't support the new capability (all legacy CSI drivers in the beginning), virtio-fs would be used as it is today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree about the leaking of storage details to cover the broad range of scenarios handled by CSI. I like the suggestion of limiting this proposal to block devices with a list of supported file systems (and explaining that in Limitations/Caveats). I was wondering if you see any major challenges with basic NFS support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be very conservative about adding on-off support for some filesystems, leaving the other unsupported. Even asking drivers to unify on ext4/xfs support is a pretty big requirement. To me it seems we're moving away from CSI ("do whatever you want in your driver, as long there is filesystem mounted in the end") to more strict environment, where a storage backed vendor is forced to implement their plugin in a limited way. We're abandoning all the new fancy storage projects and companies here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal is not advocating that certain plugin capabilities be turned off in any way or the existing virtiofs path be disabled. It's completely up to the plugin to decide whether it wants to take part in the deferred mount path. If the plugin wants to offer certain capabilities/file systems that a microvm based runtime environment cannot support, it is fine for things to fall back to the regular/existing virtiofs based path if microvm support is desired. I can emphasize this in the KEP overview sections that the fall-back path is okay.
CSI plugin will unmount the file system (if mounted as part of stage) and pass | ||
the block device path (rather than a file system mount point) along with file-system | ||
type and mount options to the Kubelet. | ||
- The kubelet passes the block device, file system type and mount options (from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would it pass potential NodePublish secrets? They're allowed in CSI and used by few CSI drivers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the secret is necessary for actions prior to the FS mount step, the CSI plugin is expected to consume the secrets and use them. For example, if a block based CSI plugin layers a dm-crypt device on top of the underlying physical block device (using a key for encryption passed as a secret) as part of node publish, it is expected to continue to do so. In such scenarios, the CSI plugin will be expected to pass the top level logical block device where that the FS mount will target through the new FileSystemMountInfo
in NodePublishVolumeResponse
- and this will not require access to the secret.
If the secret is necessary for executing the mount like in case of SMB/CIFS, then the CSI plugin is expected to resolve the secret and pass it to the runtime as the corresponding mount parameter in FileSystemMountInfo
in NodePublishVolumeResponse
. There is the caveat around the fact that the CRI container runtime may persist the OCI spec for the containers on disk - so this may be unsafe until the necessary plumbing happens in the runtimes to pass the OCI spec in memory.
I will call out above in the KEP. Are there other scenarios and patterns around consumption of CSI secrets beyond the two buckets above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we limit this proposal only to block volumes, where the VM does just mount(), then secrets should work fine.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
We (Databricks) have a use case that are blocked this KEP. Without this, it's really awkward to use block devices with VM based containers. |
Update from sig-node on Feb 8th: the CRUST APIs between Kubelet <=> Runtime Handler will have to be plumbed through CRI APIs. Runtime handler specific actions in Kubelet is not desired. |
@ddebroy This is very interesting. I added an issue on the Kata side to make sure we track this. Are you aware of any similar effort for CNI? I'm asking because Kata has very similar requirements / needs there. |
@c3d thanks! I am not aware of a similar effort for CNI. I could be wrong but my understanding is that with containerd/crio, most of the CNI interactions are handled by the CRI runtime. So that could be accomplished without Kubelet changes? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
ccade34
to
31b8bec
Compare
A CSI node plugin capable of deferring file system operations should not perform | ||
any mount operations during `NodeStageVolume` and instead perform it during | ||
`NodePublishVolume` based on whether to defer mounts to a container runtime | ||
handler (as detailed below). Such plugins need to implement support for CSI | ||
API enhancements below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty big change - most (all?) CSI drivers I met do mount volumes during NodeStage.
Is there a way how to test for this in an e2e tests / csi sanity test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. One quick clarification about testing and enforcing: if a CSI plugin wishes to support a file system that is backed by a block device and also supports parallel FS mounts there won't be any problems (from a functionality perspective) if it wishes to mount the file system in the host during NodeStageVolume
. Is it still recommended to enforce this with a csi sanity or a storage e2e test targeting arbitrary storage classes?
To avoid such corruption, ReadWriteOncePod access mode should be used for PVCs | ||
whose associated storage class specifies deferred mounting of file system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, PV with ReadWriteOncePod access mode should be pre-requisite for using this feature. Otherwise kubelet will fall back to the the regular mounting + CRI handler will use virtiofs. As described above, ReadWriteOnce is dangerous.
Since there is no way how to enforce ReadWriteOncePod for CSI inline volumes, I would limit this feature only for PVs.
When processing CSI NodePublishVolume, if a CSI plugin finds `defer_fs_ops` is | ||
set, the CSI plugin will invoke `RuntimeStageVolume` on crust_ops which will | ||
persist a JSON file called `mountInfo.json` (in the directory mentioned above) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would help me to understand the document if interface between kubelet and crust_ops would be described first (RuntimeStageVolume, RuntimeUnstageVolume, ...) and then describe detailed implementation of crust_ops ( content of /var/run/crust
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restructured a bit. Let me know if it helps readability now.
returns (RuntimeExpandVolumeResponse) {} | ||
} | ||
|
||
message RuntimeStageVolumeRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will be this API versioned? If we add a new field in Kubernetes x.y, how will old crust_ops / CRI handler interpret it? How will kubelet know that CRI runtime does not understand the field?
$ cat /var/run/crust/L3Nydi9...L21vdW50/runtime-cli | ||
/bin/kata-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is filed by crust_ops from some config file, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - this is populated by the container runtime handler when handling OCI create
for the containers in the pod.
expected to interact with crust_ops using a GRPC API over mounted domain | ||
sockets. Container runtimes are expected to implement a well defined protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed it below, how will CSI plugins find the socket?
Signed-off-by: Deep Debroy <ddebroy@gmail.com>
/sig node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question but generally PRR is good.
|
||
Rolling back/downgrading a CSI plugin version with respect to its capabilities | ||
(to defer file system mount and management operations to a container runtime | ||
handler) will lead to certain user observable failures/errors (when resizing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not ideal but as long as SIG storage is OK with it and it is documented here, I am OK with it from a PRR perspective. Certainly it's not an issue in alpha. It would be helpful to document for users how to identify pods and pod templates that will fail, so they may be cleaned up if necessary.
Is the described behavior only when the CSI plugin is downgraded but the feature gate is still enabled? That is, what is the behavior matrix with CSI plugin incapable/capable and feature gate enabled/disabled?
that might indicate a serious problem? | ||
--> | ||
|
||
If number of pods (mounting PVCs) get stuck in Initializing/ContainerCreating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there specific, published metrics for these things? Please include their names if so.
/cc @zvonkok |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ddebroy Are you still working on this? |
hi, I am not at this point - we ran into some significant challenges with a PoC of this at scale. |